Lior Blum, Dan Caspi
The main goal of our project is to demonstrate a music recommendation system that is based on audio features and lyrics analysis of tracks.
The system should use machine learning techniques in order to predict which songs each user would probably like.
The main usage of such systems is in music streaming apps platforms (such as Spotify, Apple Music, Youtube Music...). Music recommendation systems have various benefits:
There are 2 most popular recommendation systems:
The content-based approach relies on the similarity of particular items. While using a streaming music service, a user puts likes or dislikes on the songs, creating playlists, or defining beforehand his/her favorite songs/genres/artists. The main idea of a content-based recommendation system is to extract metadata and/or data of songs that a user loved, compare them with metadata/data of other songs in the service's library, and based on this, recommend similar songs to the user.
In turn, a collaborative system is built on the basis of users’ overlapping preferences and ratings of songs. It assumes that if user A and user B express similar preferences, similar songs can be recommended to them, meaning that if user A likes a particular song, it is likely that this song will also appeal to user B, and vice versa.
Collaborative filtering is widely used, not only in music services, but in shopping, video-streaming, and social networks. Its assumption - that people who agreed in the past will agree in the future - is true in many occasions in music too. To our knowledge, all major music streaming services nowadays use some kind of collaborative filtering as part of their recommendations. However, it has some major drawbacks (generally and specifically in music):
Cold Start - For a new user, there isn't enough data to make accurate recommendations. For a new song/artist, there are not enough people that have listened to the song, so it will not be recommended.
Popularity Bias - Recommendations are biased towards popularity, and limits exposure of less-popular, underrated and indie songs and artists (not only newcomers, but also "unconventional" ones).
Cultural Barrier - Music is an international language. However, in this method users are almost exclusively recommended with songs of their own language and country (with only American/English as the exception). 'Similar Users' for Collaborative filtering technique primarily means users from the same region, mainly due to their initial playlists or region&language settings, which might cause users to miss songs they could've liked only because they are of foreign origin.
Therefore, we decided to emphasize on content-based filtering which helps tackling the two issues explained above. In the music field, the 'content' which we'll use consists of:
Recommendation system based on this kind of content does not suffer as much from cold start:
Content-based filtering also does not discriminate unpopoular songs/artists or foreign music, since it purely relies on the essence of music - melody and lyrics. Therefore, it encourages users to broaden their horizons - musically and culturally.
The data used in this project is taken from "Spotify Million Playlist Dataset Challenge" - a continuation of a data science research challenge focused on music recommendation organized by Spotify (See RecSys Challenge 2018).
The project's data consists of:
We'll go over each of them and demonstrate what they look like.
import json
# Show format of one playlist and one track
with open('data/spotify_million_playlist_dataset/mpd.slice.0-999.json') as f:
ex_playlist = json.load(f)['playlists'][0]
ex_playlist['tracks'] = [ex_playlist['tracks'][0]]
ex_playlist
{'name': 'Throwbacks',
'collaborative': 'false',
'pid': 0,
'modified_at': 1493424000,
'num_tracks': 52,
'num_albums': 47,
'num_followers': 1,
'tracks': [{'pos': 0,
'artist_name': 'Missy Elliott',
'track_uri': 'spotify:track:0UaMYEvWZi0ZqiDOoHU3YI',
'artist_uri': 'spotify:artist:2wIVse2owClT7go1WT98tk',
'track_name': 'Lose Control (feat. Ciara & Fat Man Scoop)',
'album_uri': 'spotify:album:6vV5UrXcfyQD1wu4Qo2I9K',
'duration_ms': 226863,
'album_name': 'The Cookbook'}],
'num_edits': 6,
'duration_ms': 11532414,
'num_artists': 37}
import json
import os
all_songs = {}
spotify_dataset_path = 'data/spotify_million_playlist_dataset/'
# Add all songs from a Spotify slice file (from dataset) to all_songs.json
def add_all_songs_from_file(path):
with open(path) as f:
data = json.load(f)
for playlist in data['playlists']:
for track in playlist['tracks']:
track_id = track['track_uri'].partition('spotify:track:')[-1]
artist_id = track['artist_uri'].partition('spotify:artist:')[-1]
if track_id not in all_songs:
all_songs[track_id] = {
'track_name': track['track_name'],
'artist_name': track['artist_name'],
'artist_id': artist_id
}
for slice_file in os.listdir(spotify_dataset_path):
add_all_songs_from_file(spotify_dataset_path + slice_file)
with open('data/all_songs.json', 'w') as f:
json.dump(all_songs, f)
import pandas as pd
# Demonstration of the file's format
all_songs_df = pd.read_json('data/all_songs.json').T
all_songs_df.head()
| track_name | artist_name | artist_id | |
|---|---|---|---|
| 0UaMYEvWZi0ZqiDOoHU3YI | Lose Control (feat. Ciara & Fat Man Scoop) | Missy Elliott | 2wIVse2owClT7go1WT98tk |
| 6I9VzXrHxO9rA9A5euc8Ak | Toxic | Britney Spears | 26dSoYclwsYLMAKD3tpOr4 |
| 0WqIKmW4BTrj3eJFmnCKMv | Crazy In Love | Beyoncé | 6vWDO969PvNqNYHIOW5v0m |
| 1AWQoqb9bSvzTjaLralEkT | Rock Your Body | Justin Timberlake | 31TPClRtHm23RisEBtV3X7 |
| 1lzr43nnXAijIGYnCT8M8H | It Wasn't Me | Shaggy | 5EvFsr3kj42KNv97ZEnqij |
# TODO
with open('data/NRC/OneFilePerEmotion/sadness-scores.txt', 'r') as lexicon_file:
for _ in range(3):
print(lexicon_file.readline())
heartbreaking 0.969 mourning 0.969 tragic 0.961
The lyrics scraping consists of two steps.
The first step is building a list of all songs' assumed URLs. We assume their URLs by their title names and artist names (Genius generally has a fixed layout for songs URLs - '...genius.com/{artist_name}-{track_name}-lyrics').
import json
import unicodedata
import re
from time import sleep
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
all_urls = []
all_require_search = []
all_lyrics = {}
# Build Genius URLs list
def parse_name(name):
s = unicodedata.normalize('NFKD', name).encode('ascii','ignore').decode('utf8')
s = re.search(r'([^()\[\]-]*)', s).group(1).strip().replace(' ', '-').replace('&', 'and')
return re.sub('[^a-zA-Z0-9_\-]', '', s)
with open('data/songs_dataset.json', 'r') as songs_file:
with open('data/lyrics1-100000.json', 'r') as lyrics_file:
all_songs = json.load(songs_file)
all_lyrics = json.load(lyrics_file)
assert type(all_lyrics) == dict
assert type(all_songs) == dict
counter = 0
for track_id, track_data in all_songs.items():
# Limit number of songs
if counter >= 100000:
break
counter += 1
# Don't fetch lyrics we already have
if all_lyrics.get(track_id):
continue
parsed_track_name = parse_name(track_data['track_name'])
parsed_artist_name = parse_name(track_data['artist_name'])
if parsed_artist_name and parsed_track_name:
all_urls.append((track_id, track_data,
f'https://genius.com/{parsed_artist_name}-{parsed_track_name}-lyrics'))
else:
# If we could not identify the artist/track name (missing or non-ASCII), we'll have to search it
all_require_search.append((track_id, track_data))
The second step is fetching all lyrics from the assumed URLs, using parallel HTTP requests for optimal runtime (in intervals of 200 songs, to avoid overload). The remaining songs, for which we could not fetch lyrics with the assumed URLs are searched using Genius API (appears in the full code in lyrics_list_builder.py)
import aiohttp
import asyncio
from lyrics_scraper import url, lyrics # based on code from https://github.com/johnwmillr/LyricsGenius
async def get_lyrics(session, url, track_id, track_name):
try:
async with session.get(url, timeout=5) as resp:
if (resp.status == 200):
lyrics_html = await resp.text()
return (track_id, track_name + '\n' + lyrics(lyrics_html, True))
else:
print(f'Received status {resp.status} for {url}') if resp.status != 404 else None
return (track_id, None)
except Exception as e:
return (track_id, None)
songs_lyrics_list = []
async def add_to_lyrics_list(urls_list, songs_offsets=(0, None)):
""" Try to retrieve lyrics from given URLs """
global songs_lyrics_list
async with aiohttp.ClientSession() as session:
tasks = []
for track_id, track_data, url in urls_list[songs_offsets[0]:songs_offsets[1]]:
tasks.append(asyncio.ensure_future(get_lyrics(session, url, track_id, track_data['track_name'])))
songs_lyrics_list += await asyncio.gather(*tasks)
# Build lyrics list with asynchronous HTTP requests to genius.com
total_songs_num = len(all_urls)
songs_at_each_interval = 200
for i in range(0, total_songs_num, songs_at_each_interval):
asyncio.run(add_to_lyrics_list(all_urls, (i, i + songs_at_each_interval)))
sleep(0.2)
for track_id, track_lyrics in songs_lyrics_list:
if not track_lyrics:
all_require_search.append((track_id, all_songs[track_id]))
songs_lyrics_list = [(track_id, lyrics) for (track_id, lyrics) in songs_lyrics_list if lyrics]
# Add all lyrics to lyrics_corpus.json file
file_path = 'data/lyrics_corpus.json'
with open(file_path, 'w') as f:
all_lyrics.update({track_id: lyrics for (track_id, lyrics) in songs_lyrics_list if lyrics})
json.dump(all_lyrics, f)
len(all_lyrics)
78130
Before we start analyzing the lyrics, we might need to remove outliers. Outliers for our purpose are songs with lyrics too short (which probably indicate that they were not properly scraped or that they are purely instrumental), or too long (which might cause biased analysis).
import plotly.express as px
import pandas as pd
import plotly.offline as pyo
# Set notebook mode to work in offline for plotly
pyo.init_notebook_mode()
all_lyrics_series = pd.Series(all_lyrics)
fig = px.histogram([len(x) for x in all_lyrics_series.values], labels={'value':'lyrics length'})
fig.show()
We see that some songs have empty (or nearly empty) lyrics, most probably an instrumental pieces having just a title or only a few words. A few songs are so much longer than the average, that they might give results which are incomparable with the rest when analyzing them.
We'll get rid of those outliers with the z-score measurement.
from scipy.stats import zscore
import numpy as np
abs_z_scores = np.abs(zscore([len(x) for x in all_lyrics_series]))
# For our purpose, an outlier is a value that is more than 3 standard deviations from the mean
filtered_z_entries = (abs_z_scores < 3)
filtered_min_entries = [len(x) > 200 for x in all_lyrics_series]
all_lyrics_series = all_lyrics_series[filtered_z_entries & filtered_min_entries]
fig = px.histogram([len(x) for x in all_lyrics_series.values], labels={'value':'lyrics length'})
print(len(all_lyrics_series))
fig.show()
74603
Looks much more consistent. We'll continue only with these lyrics.
all_lyrics_series, which we will use in our analysis, only contains entries of songs with existing, obtainable and standard lyrics. Our main goal is to create clusters of similar songs, that will assist us in recommending songs to users (or predicting the missing songs in each playlist for our mission).
Like explained above, our emphasis is on content-based filtering - we will analyze both the audio and the lyrics of all songs. Clustering will be done with each of them individually and then together, in order to compare them and test which method works best.
Music is (usually) not just a melody, but also lyrics that tell us a story, a feeling, an idea. Lyrics can make people connect with the song, identify with it and fall in love for it. For some people (or in some genres), lyrics aren't that important and only sound matters. For others, lyrics are everything. In order to recommend songs to people, we should at least try to understand what they are saying.
Analyzing lyrics is no simple task. Unlike audio features, that are fetched as numeric features, ready to be processed, lyrics are raw texts that are filled with less-important words, slang words, punctuation, abbreviations, metaphors, ironies and plenty of sarcasm. Some problems can be overcome, others not so much. Therefore, we do not aim at understanding each song's entire implicit meaning and hidden themes. We only try to get a clue of what each song is about, and about which songs are similar in that matter.
We will need to find ways to make these texts into normalized numeric features, that can be compared between different songs and clustered.
import nltk
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# Display a Word Cloud of the most common words in lyrics
all_lyrics_raw = ' '.join(all_lyrics_series.values)
wordcloud = WordCloud(max_font_size=60, background_color = 'white').generate(all_lyrics_raw)
plt.imshow(wordcloud,interpolation="bilinear")
plt.axis("off")
plt.show()
As seen in this simple word cloud, lyrics are generally very diverse in their content, tone and meaning. On the other hand, some words are extremely common and carry very little useful information top us.
We need to find methods to extract "meaning" out of the songs. We do not aim to understand each song (that would be impossible anyway), but to find a way to produce features out of them for each song - numeric features just like the audio ones - in order to group them in clusters.
Frequent words
Emotions/Sentiments - Emotional features
Document Clustering using Topic Model:
We will focus on two methods of lyrics analysis:
Just as in any type of art, music is driven by emotions. Music allows writers, composers and producers to express their emotions through melodies, verses and choruses. It is also what makes listeners like/dislike certain music and feel attached to/detached from it - which is what interests us in this project.
In some songs, lyrics may give us a better clue about their tone and expressed feelings than their melodies, so we theorize that emotion recognition of lyrics can help us predict which songs each user likes.
We could not find any dataset of songs and their emotional values, and we are not planning to manually construct it by ourselves. Therefore, we'll need to find a dataset that can help us score our songs.
Several datasets were considered for this part but dismissed: Emotions dataset for NLP (from Kaggle), Text Emotion by CrowdFlower, ISEAR, Emotion Intensity in Tweets from the WASSA 2017 and GoEmotions. These datasets all rely on sentences structure, which is inconvenient for lyrics analysis (it is very hard and sometimes impossible to break a song into logical and correct sentences), only give a flat label for each example without a numeric intensity value, and also some of them are based on social media-typed slang, style and symbols, which is usually very different from the prosaic style of songs.
Instead, we will use the Rule-based sentiment analysis approach - score lyrics on word basis, using a given dictionary/lexicon and language rules.
The dataset we found most suitable for our purpose was the NRC Word-Emotion Association Lexicon suite, and specifically the Emotion Intensity Lexicon in it - A dictionary/lexicon of words with their emotion and intensity numeric value.
In the Emotion Recognition part, we give each song eight intensity scores for each of the following basic emotions found in our emotions lexicon dataset:
These scores are given using VADER. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool. Its main advantages are its customizability, efficiency and its ability to detect negators and boosters/reducers that appear before words. But we'll get to all of that later...
First, let's load all the lexicons:
# Load the required datasets
lexicons_by_emotion_path = 'data/NRC/OneFilePerEmotion/{}-scores.txt'
emotions = ['anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust']
emotions_lexicons = {emot: dict() for emot in emotions}
def load_emotion_lex(emotion):
with open(lexicons_by_emotion_path.format(emotion), 'r') as lexicon_file:
for line in lexicon_file:
word, intensity = line.split()
emotions_lexicons[emotion][word] = float(intensity)
def dump_emotion_lex(emotion):
with open(lexicons_by_emotion_path.format(emotion), 'w') as lexicon_file:
lexicon_file.writelines([f'{word}\t{intensity}\n' for word, intensity in emotions_lexicons[emotion].items()])
for emotion in emotions:
load_emotion_lex(emotion)
For example: the first (and strongest) entries of the 'joy' lexicon look like:
list(emotions_lexicons['joy'].items())[:8]
[('happiest', 0.986),
('happiness', 0.984),
('bliss', 0.971),
('celebrating', 0.97),
('jubilant', 0.969),
('ecstatic', 0.954),
('elation', 0.944),
('beaming', 0.938)]
We want to make sure the emotions lexicons match our lyrics data and does not miss out any important words. We can't possibly cover every single word that appears in any song, but we'll take the most common ones.
We will use a RegexpTokenizer to split the lyrics into words, and dismiss non-relevant words (stopwords).
import re
from nltk.tokenize import RegexpTokenizer
def tokenize_lyrics(text, min_word_len):
# Filter out numbers and words shorter than 3 characters
tokenizer = RegexpTokenizer(r"[a-zA-Z’‘']{{{},}}".format(min_word_len))
# Replace the popular "n'" "'cause" abbreviations with "ng" and "because" respectively for better words recognition
return [re.sub(r"n'$", "ng", re.sub(r"^'cause$|^cuz$", "because", re.sub(r'[‘’]', "'", token)))
for token in tokenizer.tokenize(text.lower())]
import nltk
import spacy.lang.en # found to be a more extensive stopwords collection than what nltk has
import spacy.lang.fr
import spacy.lang.es
# We add some more special stopwords, common in songs
lyrics_stop_words = ['ooh','yeah', 'yes', 'doo', 'hey', 'whoa' 'woah', 'ohh', 'oooh','yah','yeh','mmm', 'hmm','deh','doh','jah', 'nah', 'til', 'till']
STOP_WORDS = spacy.lang.en.STOP_WORDS.union(spacy.lang.fr.STOP_WORDS, spacy.lang.es.STOP_WORDS, lyrics_stop_words)
# Build FreqDist for all words found in lyrics, excluding stopwords
lyrics_freq_dist = nltk.FreqDist(w for w in tokenize_lyrics(all_lyrics_raw, 3) if w not in STOP_WORDS)
Out of the (500) most common words in lyrics, we will build a list of all assumed sentimental words which are not present in the emotions lexicons. Since most words are not sentimental, we will only add words that are also present in VADER's lexicon, which contains about 7,000 sentimental words.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
VADER_SCALE = 4
def_vader_analyzer = SentimentIntensityAnalyzer()
nrc_path = 'data/NRC/NRC-Emotion-Intensity-Lexicon-v1.txt'
with open(nrc_path, 'r') as lexicon_file:
lexicon = {line.partition('\t')[0] for line in lexicon_file}
# Find all sentimenatal words that are very common in lyrics (top 500) but missing in our lexicon
missing_words = []
for idx, w in enumerate(lyrics_freq_dist.most_common(500)):
# we assume that if a word is sentimental, it should appear in VADER default lexicon
sentiment = def_vader_analyzer.lexicon.get(w[0])
if sentiment and w[0] not in lexicon:
missing_words.append((w[0], idx, sentiment))
print(f'There are {len(missing_words)} missing words:\nFORMAT:(<word>, <freq>, <sentiment>)')
print(missing_words)
There are 47 missing words:
FORMAT:(<word>, <freq>, <sentiment>)
[('want', 10, 0.3), ('fuck', 46, -2.5), ('better', 51, 1.9), ('niggas', 63, -1.4), ('hard', 75, -0.4), ('play', 100, 1.4), ('wrong', 106, -2.1), ('hand', 119, 2.2), ('alright', 131, 1.0), ('best', 132, 3.2), ('care', 135, 2.2), ('wish', 150, 1.7), ('ass', 162, -2.5), ('low', 184, -1.1), ('shake', 187, -0.7), ('forget', 192, -0.9), ('sure', 221, 1.3), ('amor', 225, 3.0), ('dead', 227, -3.3), ('drop', 228, -1.1), ('fine', 231, 0.8), ('bitches', 240, -2.9), ('easy', 261, 1.9), ('okay', 270, 0.9), ('cool', 291, 1.3), ('straight', 306, 0.9), ('playing', 324, 0.8), ('cut', 329, -1.1), ('sorry', 343, -0.3), ('woo', 344, 2.1), ('tears', 349, -0.9), ('tired', 355, -1.9), ('lies', 362, -1.8), ('worth', 399, 0.9), ('great', 402, 3.1), ('bright', 420, 1.9), ('nice', 427, 1.8), ('dick', 442, -2.3), ('rich', 458, 2.6), ('loves', 462, 2.7), ('clear', 472, 1.6), ('stuck', 473, -1.0), ('number', 477, 0.3), ('thank', 487, 1.5), ('trouble', 488, -1.7), ('fucked', 492, -3.4), ('reach', 498, 0.1)]
Not too bad. We'll manually add the important ones to the lexicon's files:
# Add missing words to their respective emotion group
missing_emotions_dict = {
# If '0' intensity value is given, it is automatically calculated from VADER lexicon (/4)
# If '<word>' intensity value is given, it is given the same intensity as in <word>
'joy': {'better': 0, 'best': 0, 'loves': 'love', 'alright': 0, 'care': 0.75,
'shake': 0.5, 'fine': 0, 'okay': 0, 'easy': 0.125, 'cool': 0, 'great': 0, 'nice': 0, 'bright': 0, 'won': 0, 'rich': 0.4, 'thank': 0},
'anger': {'fuck': 0, 'wrong': 0, 'hard': 0, 'ass': 0, 'forget': 0, 'bitches': 0, 'dick': 0, 'dead': 'death',
'cut': 0, 'fucked': 0, 'lies': 'lie', 'stuck': 0, 'trouble': 0.1},
'sadness': {'hard': 0, 'low': 0, 'forget': 0, 'dead': 'death', 'heartbroken': 0.781, 'sorry': 0, 'tired': 0.25, 'tears': 0.6,
'lies': 'lie', 'stuck': 0.5, 'trouble': 0.15},
'disgust': {'ass': 0.7, 'bitches': 0, 'dead': 'death', 'cut': 0,
'fucked': 0, 'lies': 'lie'},
'fear': {'hard': 0, 'forget': 0, 'dead': 'death', 'cut': 0.1, 'stuck': 0.55, 'fucked': 0.2, 'trouble': 0},
'surprise': {'better': 0, 'dead': 'death', 'wrong': 0, 'easy': 0, 'hard': 0, 'great': 0, 'nice': 0,
'won': 0},
'trust': {'loves': 'love', 'care': 0, 'sure': 0, 'fine': 0, 'thank': 0, 'believe': 0.7},
'anticipation': {'alright': 0, 'dead': 'death', 'cool': 0, 'great': 0, 'bright': 0, 'rich': 0}
}
for emotion, new_words in missing_emotions_dict.items():
load_emotion_lex(emotion)
for word, intensity in new_words.items():
if type(intensity) == str:
intensity = emotions_lexicons[emotion][intensity]
elif intensity == 0:
intensity = def_vader_analyzer.lexicon.get(word) / VADER_SCALE
emotions_lexicons[emotion][word] = abs(float(intensity))
dump_emotion_lex(emotion)
As stated before, VADER module contains a default lexicon of words and their sentiments, with which it analyzes texts and gives them intensity positive, negative and neutral scores in scale of 0.0-1.0. However, this lexicon only has binary sentiments - positive and negative (neutral means none of them), otherwise we could just use plain VADER.
Instead, we will use VADER with our emotions lexicons in the following way:
Since VADER is a generic model, by replacing its lexicon with emotions lexicons we can (hopefully) accurately identify songs' general emotions. Emotions intensities are determined not only by the presence and frequency of words from the corresponding lexicon in the text, but also by the presence of booster/negation words in their context (such as very, extremely, barely, not, etc.)
Let's copy our lexicons to the VADER's directory and create analyzers out of them.
import shutil
import os
import vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
vader_dir = os.path.dirname(vaderSentiment.__file__)
emotions_vader_analyzers = {}
# Copy all lexicons to the VADER module's directory
for emot in emotions:
load_emotion_lex(emot)
# Intensities of words in our lexicon are in the range 0.0-1.0, VADER values are in range 0.0-4.0
# So we multiply every intensity value by 4 before copying the lexicons
emotions_lexicons[emot].update((x, y*VADER_SCALE) for x, y in emotions_lexicons[emot].items())
if emot != 'sadness' and emot != 'joy':
dump_emotion_lex(emot)
emotion_lexicon_filename = os.path.basename(lexicons_by_emotion_path.format(emot))
shutil.copyfile(lexicons_by_emotion_path.format(emot), os.path.join(vader_dir, emotion_lexicon_filename))
emotions_vader_analyzers[emot] = SentimentIntensityAnalyzer(emotion_lexicon_filename)
# Create a combined lexicon of joy and sadness
# by adding sadness words with (-1)*intensity
for word, intensity in emotions_lexicons['sadness'].items():
if word not in emotions_lexicons['joy'] or emotions_lexicons['joy'][word] < intensity:
emotions_lexicons['joy'][word] = -intensity
dump_emotion_lex('joy')
emotion_lexicon_filename = os.path.basename(lexicons_by_emotion_path.format('joy'))
shutil.copyfile(lexicons_by_emotion_path.format('joy'), os.path.join(vader_dir, emotion_lexicon_filename))
emotions_vader_analyzers['joy'] = SentimentIntensityAnalyzer(emotion_lexicon_filename)
Let's take a look at two songs for example - one very joyful, and the other very sad, according to one of our analyzers.
Note: Whenever we analyze a song, we first of all remove all stopwords, EXCEPT those that are used by VADER for the analysis (negators and boosters).
Note: full Lemmatization and Stemming before analyzing the songs are not needed and should not be done, since most of the words' forms are already in the lexicon, with a distinct value for each different form. We shall only break words like "we're" and "i'm" with TreeTagger in order to identify them as stopwords
import spacy.lang.en
from vaderSentiment.vaderSentiment import negated, BOOSTER_DICT
from random import sample
from nltk.tokenize import RegexpTokenizer
import re
tagger = ttpw.TreeTagger(TAGLANG='en')
unnecessary_stopwords = {word for word in spacy.lang.en.STOP_WORDS.union(lyrics_stop_words) if
word != 'no' and not negated([word]) and word not in BOOSTER_DICT.keys() and word not in lexicon}
def analyze_song(lyrics, emotion):
# sadness value is the negative value of 'joy' analysis
if emotion == 'sadness':
emotion, sentiment = 'joy', 'neg'
else:
sentiment = 'pos'
# Tokenize lyrics for all 3-words tokens, except 'no' which is a negator
tokenized = [w for w in tokenize_lyrics(lyrics, 2) if len(w) >= 3 or w == 'no']
return emotions_vader_analyzers[emotion].polarity_scores(' '.join(
w for w in [t.split('\t')[-1] for t in tagger.tag_text(tokenized, prepronly=True)] if w not in unnecessary_stopwords))[sentiment]
def find_max_song(N, emotion):
max_song = ('', 0.0)
lyrics_list = sample([x for x in all_lyrics_series if len(x) > 600], N)
for lyrics in lyrics_list:
val = analyze_song(lyrics, emotion)
if val > max_song[1]:
max_song = (lyrics, val)
# Print all analyzers results for it
for emotion, analyzer in emotions_vader_analyzers.items():
print(f"{emotion}: {analyzer.polarity_scores(' '.join(w for w in tokenize_lyrics(max_song[0], 2) if w not in unnecessary_stopwords))}")
# print(f'{emotion}: {analyzer.polarity_scores(trim_text(max_song[0]))}')
return max_song[0]
# Find most joyful song in randomly selected 500 songs
find_max_song(500, 'joy')
anger: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
anticipation: {'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9989}
disgust: {'neg': 0.0, 'neu': 0.969, 'pos': 0.031, 'compound': 0.1749}
fear: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
surprise: {'neg': 0.0, 'neu': 0.857, 'pos': 0.143, 'compound': 0.89}
trust: {'neg': 0.0, 'neu': 0.61, 'pos': 0.39, 'compound': 0.993}
joy: {'neg': 0.0, 'neu': 0.172, 'pos': 0.828, 'compound': 0.9996}
"Ffun\nOh, babe, yeah, yeah\nWell, well, well, well, well\nI wanna thank you\nFor your love so true\nYour tender lovin'\nKeeps my mind on you\nI'll pick you up\nAnd we'll go out and dance\nYou kiss my lips\nAnd we'll have fun, fun, fun\nMake it or break it\nGot to make you mine\nKissin' and huggin'\nAll the time\nYou are my one\nAnd only under the sun\nGive me your hand\nAnd we'll have fun, fun, fun\nOh, we're havin' fun\nI wanna thank you\nFor your love so true\nYour tender lovin'\nKeeps my mind on you\nI'll pick you up\nAnd we'll go out and dance\nYou kiss my lips\nAnd we'll have fun, fun, fun\nMake it or break it\nGot to make you mine\nKissin' and huggin'\nAll the time\nYou are my one\nAnd only under the sun\nGive me your hand\nAnd we'll have fun, fun, fun\nFun, fun, fun\nFun, fun, fun\nFun, fun, fun\nFun, fun, fun\nFun, fun, fun\nFun, fun, fun"
# Find saddest song in randomly selected 500 songs
find_max_song(500, 'sadness')
anger: {'neg': 0.0, 'neu': 0.588, 'pos': 0.412, 'compound': 0.9933}
anticipation: {'neg': 0.0, 'neu': 0.832, 'pos': 0.168, 'compound': 0.9423}
disgust: {'neg': 0.0, 'neu': 0.708, 'pos': 0.292, 'compound': 0.9828}
fear: {'neg': 0.0, 'neu': 0.433, 'pos': 0.567, 'compound': 0.9979}
surprise: {'neg': 0.0, 'neu': 0.967, 'pos': 0.033, 'compound': 0.4299}
trust: {'neg': 0.0, 'neu': 0.792, 'pos': 0.208, 'compound': 0.9709}
joy: {'neg': 0.551, 'neu': 0.351, 'pos': 0.098, 'compound': -0.9974}
"If We Only\nLet it die!\n\nSick of all the promises you've broken\n(You break them all)\nI get all the blame and you're the victim\n(It's all a lie)\nHow long will you hold on?\nWhy can't I let go?\nFind a way to escape all the hate\n\nIf we only could get over all the pain inside, all the wasted pride\nIf we only, if we only could just face the truth\nGive up the fight and let it die\n\nTell me why you drown me with your silence\n(You make me sick)\nTell me why I'm always craving violence\n(Tell me why)\nHow long will you hold on?\nWhy can't I let it go?\nFind a way to escape all the hate\n\nIf we only could get over all the pain inside, all the wasted pride\nIf we only, if we only could just face the truth\nGive up the fight and let it die\nLet it die!\nI'm done with this, I'm done with this, I'm done with this\n\nIf we only could get over all the tears we've cried, all the wasted pride\nIf we only, if we only could just face the truth\nGive up the fight and let it die\nIf we only could get over all the tears we've cried, all the wasted pride\nIf we only, if we only could just face the truth\nGive up the fight and let it die\n\nLet it die!\nLet it die!"
Before we analyze all songs, we'll filter the non-English ones.
Note: The lexicon and stopwords which we have been using are all suited for English, but there are versions for many different languages for them. VADER algorithm natively supports only English but can be suited for other languages too. We analyze only English lyrics as a demonstration, for the sake of simplicity
from langdetect import detect_langs
all_lyrics_lang = pd.Series({tr_id: detect_langs(lyric)[0] for tr_id, lyric in all_lyrics_series.items()})
all_lyrics_series = all_lyrics_series[[l.lang == 'en' and l.prob >=0.95 for l in all_lyrics_lang]]
all_lyrics_series.to_pickle('all_lyrics_series.pkl')
We can now analyze and score all songs.
import pandas as pd
all_songs_emotions = pd.DataFrame({emotion: [analyze_song(lyrics, emotion) for lyrics in all_lyrics_series]
for emotion in emotions},
index=all_lyrics_series.keys())
all_songs_emotions.to_csv('data/songs_emotions.csv')
all_songs_emotions.head()
| anger | anticipation | disgust | fear | joy | sadness | surprise | trust | |
|---|---|---|---|---|---|---|---|---|
| 0UaMYEvWZi0ZqiDOoHU3YI | 0.141 | 0.044 | 0.066 | 0.117 | 0.126 | 0.107 | 0.098 | 0.079 |
| 6I9VzXrHxO9rA9A5euc8Ak | 0.144 | 0.088 | 0.262 | 0.224 | 0.260 | 0.175 | 0.000 | 0.100 |
| 0WqIKmW4BTrj3eJFmnCKMv | 0.259 | 0.101 | 0.049 | 0.271 | 0.284 | 0.223 | 0.056 | 0.254 |
| 1AWQoqb9bSvzTjaLralEkT | 0.063 | 0.097 | 0.082 | 0.064 | 0.280 | 0.000 | 0.068 | 0.217 |
| 1lzr43nnXAijIGYnCT8M8H | 0.035 | 0.111 | 0.006 | 0.082 | 0.075 | 0.071 | 0.031 | 0.078 |
We can also visualize the distribution of each emotion and see the differences between them
import pandas as pd
all_songs_emotions = pd.read_csv('data/songs_emotions.csv', index_col=0)
pd.DataFrame(all_songs_emotions).hist(layout=(4,2), figsize=[12, 12], bins=50)
array([[<AxesSubplot:title={'center':'anger'}>,
<AxesSubplot:title={'center':'anticipation'}>],
[<AxesSubplot:title={'center':'disgust'}>,
<AxesSubplot:title={'center':'fear'}>],
[<AxesSubplot:title={'center':'joy'}>,
<AxesSubplot:title={'center':'sadness'}>],
[<AxesSubplot:title={'center':'surprise'}>,
<AxesSubplot:title={'center':'trust'}>]], dtype=object)
Let's visualize all features, one pair at a time, to find out how much (if any) they are correlated:
import seaborn as sns
from scipy.stats import zscore
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
def hide_current_axis(*args, **kwargs):
plt.gca().set_visible(False)
def jointplot_hex(x, y, color, **kwargs):
cmap = sns.light_palette(color, as_cmap=True)
xy_df = pd.concat([x, y], axis=1)
nonzero_cells_filter = (xy_df > 0.007)
xy_df = pd.DataFrame(zscore(xy_df), columns=[x.name, y.name], index=x.index)
outliers_filter = nonzero_cells_filter & (np.abs(xy_df) < 3.5)
xy_df = xy_df[outliers_filter.iloc[:,0] & outliers_filter.iloc[:,1]]
pearson = stats.pearsonr(xy_df.iloc[:,0], xy_df.iloc[:,1])[0]
ax = plt.gca()
plt.hexbin(xy_df.iloc[:,0], xy_df.iloc[:,1], cmap=cmap, gridsize=30, **kwargs)
plt.annotate(f'$pearson = {pearson:.3f}$',
xy=(0.1, 0.9), xycoords='axes fraction',
ha='left', va='center',
bbox={'boxstyle': 'round', 'fc': 'powderblue', 'ec': 'navy'})
def filtered_hist(x, **kwargs):
nonzero_cells_filter = (x > 0.007)
x = pd.Series(zscore(x), name=x.name, index=x.index)
outliers_filter = nonzero_cells_filter & (np.abs(x) < 3.5)
x = x[outliers_filter]
plt.hist(x=x, bins=50, **kwargs)
# For our purpose, an outlier is a value that is more than 3 standard deviations from the mean
g = sns.PairGrid(data=all_songs_emotions, corner=True)
g.map_lower(jointplot_hex)
g.map_upper(hide_current_axis)
g.map_diag(filtered_hist, color='cornflowerblue')
<seaborn.axisgrid.PairGrid at 0xa72b1130>
We can see a moderate correlation in two pairs of features:
We may consider omitting one feature out of each pair as a dimensionality reduction if needed
As large and exhaustive as our lexicons are, the entire English vocabulary, even just the relatively common and generally used part of it is much larger. Many emotional words will not be detected when using the lexicons alone. Many other words might have multiple meanings and be misunderstood, and others were written differently their dictionary form (like cryin' instead of crying).
Moreover, VADER analyzer is only based on scoring unigrams and some bigrams: single sentimental words and sometimes negatives/boosters that go along with them and increase/reduce/negate their intensity. It might completely ignore phrases and any other meaningful word combinations.
Perhaps we can combine deep learning with our analyzers in order to tackle these issues.
Besides emotions, songs' lyrics talk about plenty of subjects. These subjects can be related to sub-emotions, or they can be completely neutral. We want to find the most common and significant topics in our entire songs corpus, and find out to which topic(s) belongs each song.
A topic model is a generative model that intends to discover underlying topics in a collection of documents and each documents’ assumed closeness to this topic. A popular and well-established Topic Modeling algorithm is Latent Dirichlet Allocation (LDA), which is a probabilistic generative model that builds on the assumption that every document in a literature corpus is a mixture of latent topics and each of these topics themselves is a probability distribution over words.
By using the topic model on our songs lyrics, we may be able to identify common topics/themes in our songs and cluster songs based on these topics.
We'll do some necessary preprocessing first.
from nltk.tokenize import RegexpTokenizer
# Filter out numbers and words shorter than 3 characters
tokenizer = RegexpTokenizer(r'[a-zA-Z’‘\']{3,}')
# Replace the popular "n'" "'cause" abbreviations with "ng" and "because" respectively for better words recognition
all_lyrics_tokenized = [[re.sub(r"n'$", "ng", re.sub(r"^'cause$|^cuz$", "because", re.sub(r'[‘’]', "'", token)))
for token in tokenizer.tokenize(lyric.lower())]
for lyric in all_lyrics_series]
import treetaggerwrapper as ttpw
import spacy.lang.en
from vaderSentiment.vaderSentiment import NEGATE, BOOSTER_DICT
import pandas as pd
tagger = ttpw.TreeTagger(TAGLANG='en')
# This time we don't need negation and booster words
topics_stopwords = spacy.lang.en.STOP_WORDS.union(NEGATE, BOOSTER_DICT.keys(), lyrics_stop_words)
for idx, song in enumerate(all_lyrics_tokenized):
all_lyrics_tokenized[idx] = [token for token in [t.split('\t')[-1] for t in tagger.tag_text(song)]
if token not in topics_stopwords and len(token) > 2]
pd.Series(all_lyrics_tokenized).to_pickle('all_lyrics_tokenized.pkl')
In order to perform Latent Dirichlet Allocation, we use the popular and well-established Python library gensim, which requires a dictionary representation of the documents. This means all tokens are mapped to a unique ID, which reduces the overall dimensionality of a literature corpus. In addition, we filter out tokens that occur in less than 60 songs, as well as tokens that occur in more than 80% of songs.
from gensim.corpora import Dictionary
dictionary = Dictionary(all_lyrics_tokenized)
dictionary.filter_extremes(no_below = 60, no_above = 0.8)
dictionary.save('./resources/lyrics_for_topics_dict.dct')
dictionary.save_as_text('./resources/lyrics_for_topics_dict.txt')
Each song (as of now a list of tokens) is converted into the bag-of-words format, which only stores the unique token ID and its count for each song.
gensim_corpus = [dictionary.doc2bow(song) for song in all_lyrics_tokenized]
Now our data is finally ready for topic extraction using LDA.
LDA is implemented in two major libraries: Scikit-learn and Gensim. We'll compare their results and runtime with default parameters (no optimization).
%%time
from gensim.models import LdaMulticore
initial_gensim_model = LdaMulticore(corpus=gensim_corpus, num_topics=6, id2word=dictionary, random_state=28)
Wall time: 38.1 s
# sklearn's LDA requires similar preprocessing to what we did earlier with gensim.
from sklearn.feature_extraction.text import CountVectorizer
tf_vectorizer = CountVectorizer(preprocessor=lambda doc: doc,
tokenizer=lambda doc: doc,
token_pattern = r'[a-zA-Z’‘\']{3,}',
max_df = 0.8,
min_df = 60)
dtm_tf = tf_vectorizer.fit_transform(all_lyrics_tokenized)
%%time
from sklearn.decomposition import LatentDirichletAllocation
# Set same parameters as in Gensim's LDA
lda_tf = LatentDirichletAllocation(n_components=6, learning_method='online',
learning_decay=initial_gensim_model.decay,
n_jobs=-1,
max_iter=initial_gensim_model.iterations,
batch_size=initial_gensim_model.chunksize,
learning_offset=initial_gensim_model.offset,
evaluate_every=initial_gensim_model.eval_every,
random_state=28)
lda_tf.fit(dtm_tf)
Wall time: 11min 5s
LatentDirichletAllocation(batch_size=2000, evaluate_every=10,
learning_decay=0.5, learning_method='online',
learning_offset=1.0, max_iter=50, n_components=6,
n_jobs=-1, random_state=28)
Gensim's Topics:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
vis_data = gensimvis.prepare(initial_gensim_model, gensim_corpus, dictionary)
pyLDAvis.save_html(vis_data, f'./Lyrics_initial_gensim.html')
pyLDAvis.display(vis_data)
Scikit-learn's Topics:
import pyLDAvis.sklearn as skvis
vis_data = skvis.prepare(lda_tf, dtm_tf, tf_vectorizer)
pyLDAvis.save_html(vis_data, f'./Lyrics_initial_sklearn.html')
pyLDAvis.display(vis_data)
We can see that the results are quite similar, but the runtime of Gensim is much smaller.
We can summarize the results in the following table, consisting of similar topics between the two according to the visualizations above:
| Topic Label(s) | Gensim Topic Num | Scikit-learn Topic Num |
|---|---|---|
| Religious, Spiritual | 5 | 4 |
| Explicit, Violent | 1 | 2 |
| Light Romance, Passion | 2 | 5 |
| Pain, Loss, Fear | 4 | absent |
| Nature | absent | 6 |
| Love | 6 | ~5,1 |
| Family | ~3 | 3 |
It looks like both models need optimization.
Since the results of both of them are comparable, we'll continue with Gensim because of its runtime and since Gensim has a built-in evaluation method called Coherence (which we'll expand upon later on).
Another extra step needs to be taken to optimize results and that is identifying an optimum amount of topics (k), as well as the hyperparameters: Document-Topic Density (α) and Word-Topic Density (η) parameters.
LDA model clusters our songs into topics, but it requires a number of topics parameter - k. If k is too small, our topics will be too general, contain too many words, and the songs that share them will not actually be similar. However, giving a k which is too large will yield fractions of topics, repetition of certain words in many of them, and not enough similar songs for each song.
alpha parameter determines how many topics will be extracted from each song ("document"), and eta determines how many words are included in each topic. Both of these parameters can also impact our results significantly.
The mathematical representation of how good a topic is called Topic Coherence. it means the degree of semantic similarity between high scoring words in the topic (detailed information here). There are a few measures to calculate it, we chose NPMI over the default C_V for reliability reasons (see discussion here).
We will create multiple LDA models for our data, with different parameter values, compute their topics' average coherence values, and find the combination which gives us the maximum value.
from gensim.models import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel
def build_lda_model(k, alpha='symmetric', eta='symmetric'):
# Generate an LDA model for the tested parameters values
return LdaMulticore(corpus=gensim_corpus, num_topics=k, id2word=dictionary, alpha=alpha, eta=eta,
passes=10, iterations=400, chunksize=1000, random_state=28)
def get_coherence(model, measure='c_npmi'):
# Calculate topics coherence for the given model
coherencemodel = CoherenceModel(model=model, texts=all_lyrics_tokenized, dictionary=dictionary,
window_size=100, coherence=measure)
return coherencemodel.get_coherence()
We'll tune k, a and then b separately:
import pandas as pd
lda_model_for_k = {}
coherences = pd.Series(dtype='float64')
k_test_range = range(4, 13)
for i in k_test_range:
if i not in lda_model_for_k.keys():
lda_model_for_k[i] = build_lda_model(i, alpha=0.4, eta=0.1)
lda_model_for_k[i].save(f'./lda_models/a04b01/k{i}')
if i not in coherences.keys():
coherences.at[i] = get_coherence(lda_model_for_k[i])
optimal_k = coherences.idxmax()
coherences.to_pickle('./lda_models/a04b01/coherences.pkl')
coherences.plot(xticks=k_test_range, xlabel='k', ylabel='coherence', marker='o')
<AxesSubplot:xlabel='k', ylabel='coherence'>
import numpy as np
lda_model_for_a = {}
coherences = pd.Series(dtype='float64')
a_test_range = np.round(np.arange(0.1, 1.3, 0.1), 1)
for i in a_test_range:
if i not in lda_model_for_a.keys():
lda_model_for_a[i] = build_lda_model(optimal_k, alpha=i, eta=0.1)
lda_model_for_a[i].save(f'./lda_models/k10b01/a{i}')
if i not in coherences.keys():
coherences.at[i] = get_coherence(lda_model_for_a[i])
optimal_a = coherences.idxmax()
coherences = coherences.sort_index()
coherences.to_pickle('./lda_models/k10b01/coherences.pkl')
coherences.plot(xticks=a_test_range, xlabel='a', ylabel='coherence', marker='o')
<AxesSubplot:xlabel='a', ylabel='coherence'>
lda_model_for_b = {}
coherences = pd.Series(dtype='float64')
b_test_range = np.round(np.arange(0.01, 0.12, 0.01), 2)
for i in b_test_range:
if i not in lda_model_for_b.keys():
lda_model_for_b[i] = build_lda_model(optimal_k, optimal_a, eta=i)
lda_model_for_b[i].save(f'./lda_models/k10a03/b{i}')
if i not in coherences.keys():
coherences.at[i] = get_coherence(lda_model_for_b[i])
optimal_model = lda_model_for_b[coherences.idxmax()]
coherences = coherences.sort_index()
coherences.to_pickle('./lda_models/k10a03/coherences.pkl')
coherences.plot(xticks=b_test_range, xlabel='b', ylabel='coherence', marker='o')
<AxesSubplot:xlabel='b', ylabel='coherence'>
Now that we have found the model that yields the most coherent topics, let's visualize the inferred topics:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
#NEW ONE
vis_data = gensimvis.prepare(optimal_model, gensim_corpus, dictionary)
pyLDAvis.save_html(vis_data, f'./Lyrics_LDA_optimal.html')
pyLDAvis.display(vis_data)
optimal_model.show_topics(num_words=7)
[(0, '0.082*"baby" + 0.072*"know" + 0.065*"want" + 0.057*"girl" + 0.057*"wanna" + 0.041*"like" + 0.032*"good"'), (1, '0.046*"day" + 0.031*"high" + 0.023*"sing" + 0.023*"song" + 0.022*"sky" + 0.021*"sun" + 0.021*"dream"'), (2, '0.242*"love" + 0.041*"heart" + 0.038*"need" + 0.033*"know" + 0.027*"way" + 0.027*"hold" + 0.019*"fall"'), (3, '0.032*"like" + 0.021*"come" + 0.021*"home" + 0.010*"ride" + 0.010*"drink" + 0.009*"old" + 0.009*"night"'), (4, '0.035*"know" + 0.032*"time" + 0.024*"feel" + 0.020*"life" + 0.017*"think" + 0.015*"mind" + 0.014*"leave"'), (5, '0.029*"man" + 0.016*"like" + 0.013*"work" + 0.011*"know" + 0.010*"people" + 0.010*"tell" + 0.009*"play"'), (6, '0.038*"fuck" + 0.032*"like" + 0.029*"bitch" + 0.028*"nigga" + 0.025*"shit" + 0.021*"niggas" + 0.018*"money"'), (7, '0.089*"let" + 0.064*"gonna" + 0.034*"come" + 0.031*"dance" + 0.031*"tonight" + 0.030*"night" + 0.023*"stop"'), (8, '0.050*"away" + 0.044*"run" + 0.026*"come" + 0.017*"walk" + 0.015*"die" + 0.013*"free" + 0.013*"rain"'), (9, '0.047*"god" + 0.030*"world" + 0.026*"lord" + 0.019*"come" + 0.019*"heaven" + 0.017*"life" + 0.016*"jesus"')]
These topics can be summarized as:
| Topic Num (in pyLDAvis) | Topic ID (in the model) | Topic Labels |
|---|---|---|
| 1 | 4 | Loss, Self-Reflection |
| 2 | 6 | Explicit, Violent |
| 3 | 0 | Light Romance, Passion |
| 4 | 5 | Childhood, Family |
| 5 | 3 | Country, Road, American |
| 6 | 2 | Love, Trust |
| 7 | 8 | Pain, Anxiety, Sadness |
| 8 | 7 | Dance, Party |
| 9 | 1 | Nature, Holidays |
| 10 | 9 | Religious |
Looks much better than the topics we started with.
With these topics we can cluster songs that are similar in their content.
Using the function get_document_topics of Gensim's LDA model, we can get topic probabilities list for each song (the probability that it belongs to each of the extracted topics). These can serve as numeric features for our purpose.
Note: the extracted topics indices differ from those in pyLDAvis which were only for visualization. The topics indices are as following:
import pandas as pd
def get_all_topics_probabilities(song_bow):
return [x[1] for x in optimal_model.get_document_topics(song_bow, minimum_probability=0.0)]
all_songs_topics = pd.DataFrame(np.array([get_all_topics_probabilities(song_bow) for song_bow in gensim_corpus]),
index=all_lyrics_series.keys())
all_songs_topics.to_csv('data/songs_topics.csv')
all_songs_topics.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0UaMYEvWZi0ZqiDOoHU3YI | 0.117437 | 0.007827 | 0.001165 | 0.042169 | 0.001136 | 0.141234 | 0.149714 | 0.537050 | 0.001197 | 0.001070 |
| 6I9VzXrHxO9rA9A5euc8Ak | 0.183528 | 0.044392 | 0.093599 | 0.567572 | 0.003826 | 0.003209 | 0.003181 | 0.044844 | 0.052725 | 0.003123 |
| 0WqIKmW4BTrj3eJFmnCKMv | 0.543460 | 0.012094 | 0.204901 | 0.001381 | 0.036386 | 0.045766 | 0.111666 | 0.011561 | 0.031535 | 0.001250 |
| 1AWQoqb9bSvzTjaLralEkT | 0.359009 | 0.027817 | 0.045830 | 0.001499 | 0.001441 | 0.045759 | 0.051024 | 0.342494 | 0.123838 | 0.001289 |
| 1lzr43nnXAijIGYnCT8M8H | 0.161864 | 0.007415 | 0.001608 | 0.160031 | 0.233940 | 0.261153 | 0.135508 | 0.034950 | 0.001869 | 0.001661 |
By visualizing all topics distribution against each other we get:
g = sns.PairGrid(data=all_songs_topics, corner=True)
g.map_lower(jointplot_hex)
g.map_upper(hide_current_axis)
g.map_diag(filtered_hist, color='cornflowerblue')
<seaborn.axisgrid.PairGrid at 0x433bf5e0>
There appears to be no correlated topics. All topics add enough information to be considered necessary.
We'll plot the various features sets against each other and see if a correlation exists between any of them.
all_songs_topics_emotions = pd.merge(all_songs_topics, all_songs_emotions, how='inner', left_index=True, right_index=True)
g = sns.PairGrid(data=all_songs_topics_emotions, corner=True, x_vars=all_songs_emotions.columns, y_vars=all_songs_topics.columns)
g.map_lower(jointplot_hex)
g.map_upper(hide_current_axis)
g.map_diag(hide_current_axis)
<seaborn.axisgrid.PairGrid at 0x50d4e910>
It appears that there is only weak correlation between some features.
Now that we have gathered the features of all songs, we can try clustering them.
Our plan is to use unsupervised learning on a portion of the songs in order to create the clusters, and use these clusters as labeled data for an ANN in order to label all other songs for maximal accuracy.
Since our assumption is that playlists should be pretty coherent and homogenous, we will assess how good our clusters are using the playlists dataset with the ratio of (how many different clusters are in a playlist) / (number of songs in a playlist). If this value is small, it means that most of the songs in the playlist are of the same cluster/same few clusters, and therefore could have been predicted by the model, and vice-versa.
First, we'll try clustering using each of the features set individually. Since in each of them the features num is 10 or lower, we'll not reduce dimension any further in this stage.
We'll compare several unsupervised methods on a random batch of 7000 songs
from random import sample
all_songs_topics_sample = all_songs_topics.loc[sample(list(all_songs_topics.index), 7000)]
all_songs_topics_sample.to_pickle('all_songs_topics_sample.pkl')
First, K-Means algorithm.
We'll use the elbow method to choose the optimal K (number of clusters). Elbow method gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow.
from sklearn.cluster import KMeans
from yellowbrick.cluster.elbow import kelbow_visualizer
elbow_vis = kelbow_visualizer(KMeans(random_state=28), all_songs_topics_sample, k=range(20, 400, 20))
import pickle
from sklearn.cluster import KMeans
optimal_kmeans_clusters = elbow_vis.elbow_value_
kmeans = KMeans(optimal_kmeans_clusters, random_state=28)
with open(f"./clustering_models/topics/kmeans{optimal_kmeans_clusters}.pkl", "wb") as f:
pickle.dump(kmeans, f)
We've created our first songs clustering model!
Currently, it is just a 'black box' that receives topic values of a song and returns an arbitrary cluster number. In order to visualize and explain the clustering, we'll use SHAP library.
import shap
from IPython.display import IFrame
# A workaround to display certain SHAP plots in HTML
def display_plot_in_iframe(plot, name):
shap.initjs()
shap.save_html(f"./resources/{name}.htm", plot)
return IFrame(f'./resources/{name}.htm', 1000, 400)
import warnings
warnings.filterwarnings('ignore')
clusterer = kmeans.fit(all_songs_topics_sample[:600])
explainer = shap.KernelExplainer(model=clusterer.predict,data=all_songs_topics_sample[:600])
shap_values = explainer.shap_values(all_songs_topics_sample[:600])
# shap.initjs()
# shap.summary_plot(shap_values, all_songs_topics_sample[:600])
Using 600 background data samples could cause slower run times. Consider using shap.sample(data, K) or shap.kmeans(data, K) to summarize the background as K samples. 100%|██████████| 600/600 [52:59<00:00, 5.30s/it]
display_plot_in_iframe(shap.force_plot(explainer.expected_value, shap_values, all_songs_topics_sample[:600], show=False),
'force_kmeans100')